data("Howell1")
d <- Howell1
d$sex <- ifelse(d$male == 1, 2, 1) # 1 = female, 2 = male
head(d[, c("male", "sex")]) male sex
1 1 2
2 0 1
3 0 1
4 1 2
5 0 1
6 1 2
Categories and curves
Workspace setup:
As we develop more useful models, we’ll begin to practice the art of generating models with multiple estimands. An estimand is a quantity we want to estimate from the data. Our models may not themselves produce the answer to our central question, so we need to know how to calculate these values from the posterior distributions.
This is going to be different from prior regression courses (PSY 612), where our models were often designed to give us precisely what we wanted. For example, consider the regression:
\[ \hat{Y} = b_0 + b_1(D) \] Where \(Y\) is a continuous outcome and \(D\) is a dummy coded variable (0 = control; 1 = treatment).
Forget dummy codes. From here on out, we will incorporate categorical causes into our models by using index variables. An index variable contains integers that correspond to different categories. The numbers have no inherent meaning – rather, they stand as placeholders or shorthand for categories.
Let’s write a mathematical model to express height in terms of sex.
\[\begin{align*} h_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{SEX[i]} \\ \alpha_j &\sim \text{Normal}(178, 20)\text{ for }j = 1..2 \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]
quap() mean sd 5.5% 94.5%
a[1] 134.90835 1.6068669 132.34026 137.47643
a[2] 142.57862 1.6973997 139.86585 145.29140
sigma 27.30881 0.8279547 25.98558 28.63204
Here, we are given the estimates of the parameters specified in our model: the average height of women (a[1]) and the average height of men (a[2]). But our question is whether these average heights are different. How do we get that?
List of 2
$ sigma: num [1:10000] 27.4 27 28.1 25.9 28.5 ...
$ a : num [1:10000, 1:2] 134 137 137 134 138 ...
- attr(*, "source")= chr "quap posterior: 10000 samples from m1"
[,1] [,2]
[1,] 134.1883 140.5603
[2,] 137.0044 142.3574
[3,] 137.0620 143.7545
[4,] 134.4854 140.9569
[5,] 138.3727 142.2990
[6,] 136.1427 142.0852
mean sd 5.5% 94.5% histogram
sigma 27.306050 0.8326926 25.98662 28.652630 ▁▁▁▁▃▇▇▇▃▂▁▁▁
a[1] 134.897914 1.6043493 132.32729 137.407035 ▁▁▁▁▂▅▇▇▅▂▁▁▁
a[2] 142.585173 1.6935561 139.88835 145.292251 ▁▁▁▁▁▃▇▇▇▃▂▁▁▁▁
diff_fm -7.687258 2.3322585 -11.39943 -4.003179 ▁▁▁▃▇▇▃▁▁▁
We can create two plots. One is the posterior distributions of average female and male heights and one is the average difference.
p1 <- post %>% as.data.frame() %>%
pivot_longer(starts_with("a")) %>%
mutate(sex = ifelse(name == "a.1", "female", "male")) %>%
ggplot(aes(x=value, color = sex)) +
geom_density(linewidth = 2) +
labs(x = "height(cm)")
p2 <- post %>% as.data.frame() %>%
ggplot(aes(x=diff_fm)) +
geom_density(linewidth = 2) +
labs(x = "difference in height(cm)")
( p1 | p2)A note that the distributions of the mean heights is not the same as the distribution of heights period. For that, we need the posterior predictive distributions.
pred_f <- rnorm(1e4, mean = post$a[,1], sd = post$sigma )
pred_m <- rnorm(1e4, mean = post$a[,2], sd = post$sigma )
pred_post = data.frame(pred_f, pred_m) %>%
mutate(diff = pred_f-pred_m)
# plot distributions
p1 <- pred_post %>% pivot_longer(starts_with("pred")) %>%
mutate(sex = ifelse(name == "pred_f", "female", "male")) %>%
ggplot(aes(x = value, color = sex)) +
geom_density(linewidth = 2) +
labs(x = "height (cm)")
# plot difference
# Compute density first
density_data <- density(pred_post$diff)
# Convert to a tibble for plotting
density_df <- tibble(
x = density_data$x,
y = density_data$y,
fill_group = ifelse(x < 0, "male", "female") # Define fill condition
)
# Plot with area fill
p2 <- ggplot(density_df, aes(x = x, y = y, fill = fill_group)) +
geom_area() + # Adjust transparency if needed
geom_line(linewidth = 1.2, color = "black") + # Keep one continuous curve
labs(x = "Difference in height (F-M)", y = "density") +
guides(fill = "none")
(p1 | p2)In the rethinking package, the dataset milk contains information about the composition of milk across primate species, as well as some other facts about those species. The taxonomic membership of each species is included in the variable clade; there are four categories.
kcal.per.g). 1'data.frame': 29 obs. of 8 variables:
$ clade : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...
$ species : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
$ kcal.per.g : num 0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...
$ perc.fat : num 16.6 19.3 14.1 14.9 27.3 ...
$ perc.protein : num 15.4 16.9 16.9 13.2 19.5 ...
$ perc.lactose : num 68 63.8 69 71.9 53.2 ...
$ mass : num 1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...
$ neocortex.perc: num 55.2 NA NA NA NA ...
\[\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CLADE}[i]} \\ \alpha_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
Exercise: Now fit your model using quap(). It’s ok if your mathematical model is a bit different from mine.
flist <- alist(
K ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
)
m2 <- quap(
flist, data = milk
)
precis( m2, depth=2 ) mean sd 5.5% 94.5%
a[1] -0.4843464 0.21764156 -0.83217966 -0.1365132
a[2] 0.3662511 0.21705924 0.01934847 0.7131537
a[3] 0.6752371 0.25753382 0.26364833 1.0868259
a[4] -0.5858007 0.27450960 -1.02452009 -0.1470814
sigma 0.7196466 0.09653382 0.56536687 0.8739263
rethinkingPlot the following distributions:
post <- extract.samples( m2 )
names(labels) = paste("a.", 1:4, sep = "")
post %>% as.data.frame() %>%
pivot_longer(starts_with("a")) %>%
mutate(name = recode(name, !!!labels)) %>%
ggplot(aes(x = value, color = name)) +
geom_density(linewidth = 2) +
labs(title = "Posterior distribution of expected milk energy")post <- extract.samples( m2 )
a.1 = rnorm(1e4, post$a[,1], post$sigma)
a.2 = rnorm(1e4, post$a[,2], post$sigma)
a.3 = rnorm(1e4, post$a[,3], post$sigma)
a.4 = rnorm(1e4, post$a[,4], post$sigma)
data.frame(a.1, a.2, a.3, a.4) %>%
pivot_longer(everything()) %>%
mutate(name = recode(name, !!!labels)) %>%
ggplot(aes(x = value, color = name)) +
geom_density(linewidth = 2) +
labs(title = "Posterior distribution of predicted milk energy")Let’s return to the height example. What if we want to control for weight?
\[\begin{align*} h_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{S[i]} + \beta_{S[i]}(W_i-\bar{W})\\ \alpha_j &\sim \text{Normal}(178, 20)\text{ for }j = 1..2 \\ \beta_j &\sim \text{Normal}(0, 5)\text{ for }j = 1..2 \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]
mean sd 5.5% 94.5%
a[1] 138.565639 0.55777123 137.674213 139.457066
a[2] 138.162973 0.58854668 137.222361 139.103584
b[1] 1.805476 0.04138462 1.739335 1.871616
b[2] 1.735288 0.03688954 1.676332 1.794245
sigma 9.330419 0.28287315 8.878333 9.782505
List of 3
$ sigma: num [1:10000] 9.31 9.07 9.15 9.37 9.46 ...
$ a : num [1:10000, 1:2] 139 138 139 139 138 ...
$ b : num [1:10000, 1:2] 1.85 1.79 1.84 1.82 1.77 ...
- attr(*, "source")= chr "quap posterior: 10000 samples from m3"
Plot the slopes using extract.samples()
xbar = mean(d$weight)
plot(NULL, xlim = range(d$weight), ylim = c(0, 200),
xlab = "weight", ylab = "height")
#plot each line
for(i in 1:50){
curve(post$a[i, 1] +post$b[i, 1]*(x-xbar),
add = T,
col=col.alpha("#1c5253",0.1))
curve(post$a[i, 2] +post$b[i, 2]*(x-xbar),
add = T,
col=col.alpha("#e07a5f",0.1))
}Plot the slopes using link()
xseq <- seq( min(d$weight), max(d$weight), len=100)
plot(NULL, xlim = range(d$weight), ylim = range(d$height), xlab = "weight", ylab = "height")
muF <-
link(m3, data=list(sex=rep(1,100), weight=xseq, Wbar = mean(d$weight)))
lines(xseq, apply(muF, 2, mean), lwd = 2, col = "#1c5253" )
muM <-
link(m3, data=list(sex=rep(2,100), weight=xseq, Wbar = mean(d$weight)))
lines(xseq, apply(muM, 2, mean), lwd = 2, col = "#e07a5f")Return to the milk data. Write a mathematical model expressing the energy of milk as a function of the species body mass (mass) and clade category. Be sure to include priors. Fit your model using quap().
\[\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CLADE}[i]} + \beta_{\text{CLADE}[i]}(M-\bar{M})\\ \alpha_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \beta_i &\sim \text{Normal}(0, 0.5) \text{ for }j=1..4 \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
dat <- list(
K = standardize(milk$kcal.per.g),
M = milk$mass,
Mbar = mean(milk$mass),
clade_id = milk$clade_id
)
flist <- alist(
K ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] +b[clade_id]*(M-Mbar),
a[clade_id] ~ dnorm( 0 , 0.5 ) ,
b[clade_id] ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
)
m4 <- quap(
flist, data = dat
) mean sd 5.5% 94.5%
a[1] -0.434377353 0.261135975 -0.851723077 -0.017031628
a[2] -0.282548381 0.478559519 -1.047378922 0.482282160
a[3] 0.369109770 0.418895104 -0.300365511 1.038585051
a[4] -0.006397850 0.498109259 -0.802472651 0.789676950
b[1] -0.002675976 0.007184096 -0.014157549 0.008805597
b[2] -0.061283582 0.040138478 -0.125432623 0.002865459
b[3] -0.047906928 0.050280500 -0.128264878 0.032451021
b[4] 0.064810066 0.046233643 -0.009080225 0.138700357
sigma 0.692582212 0.092048312 0.545471231 0.839693194
xseq <- seq( min(milk$mass), max(milk$mass), len=100)
Mbar = mean(milk$mass)
custom_colors = c("#1c5253", "#e07a5f", "#f2cc8f", "#81b29a")
colors = custom_colors[milk$clade_id]
plot(milk$K ~ milk$mass, col = colors,
pch = 16,
xlim = range(milk$mass), ylim = range(milk$K),
xlab = "weight", ylab = "height")
mu1 <-
link(m4, data=list(clade_id=rep(1,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu1, 2, mean), lwd = 2, col = "#1c5253" )
mu2 <-
link(m4, data=list(clade_id=rep(2,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu2, 2, mean), lwd = 2, col = "#e07a5f" )
mu3 <-
link(m4, data=list(clade_id=rep(3,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu3, 2, mean), lwd = 2, col = "#f2cc8f" )
mu4 <-
link(m4, data=list(clade_id=rep(4,100), M=xseq, Mbar = Mbar))
lines(xseq, apply(mu4, 2, mean), lwd = 2, col = "#81b29a" )
legend("topright", legend = levels(milk$clade),
col = custom_colors, pch = 16) mean sd 5.5% 94.5% histogram
year 1408.000000 350.8845964 867.77000 1948.23000 ▇▇▇▇▇▇▇▇▇▇▇▇▁
doy 104.540508 6.4070362 94.43000 115.00000 ▁▂▅▇▇▃▁▁
temp 6.141886 0.6636479 5.15000 7.29470 ▁▃▅▇▃▂▁▁
temp_upper 7.185151 0.9929206 5.89765 8.90235 ▁▂▅▇▇▅▂▂▁▁▁▁▁▁▁
temp_lower 5.098941 0.8503496 3.78765 6.37000 ▁▁▁▁▁▁▁▃▅▇▃▂▁▁▁
vars n mean sd median trimmed mad min max
year 1 1215 1408.00 350.88 1408.00 1408.00 450.71 801.00 2015.00
doy 2 827 104.54 6.41 105.00 104.54 5.93 86.00 124.00
temp 3 1124 6.14 0.66 6.10 6.11 0.61 4.67 8.30
temp_upper 4 1124 7.19 0.99 7.04 7.10 0.92 5.45 12.10
temp_lower 5 1124 5.10 0.85 5.14 5.10 0.72 0.75 7.74
range skew kurtosis se
year 1214.00 0.00 -1.20 10.07
doy 38.00 0.00 -0.15 0.22
temp 3.63 0.40 0.11 0.02
temp_upper 6.65 1.05 1.71 0.03
temp_lower 6.99 -0.17 1.88 0.03
d2 <- d[ complete.cases(d$doy) , ] # complete cases on doy
num_knots <- 15
knot_list <- quantile( d2$year , probs=seq(0,1,length.out=num_knots) )
knot_list 0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714% 50%
812 1036 1174 1269 1377 1454 1518 1583
57.14286% 64.28571% 71.42857% 78.57143% 85.71429% 92.85714% 100%
1650 1714 1774 1833 1893 1956 2015
\[\begin{align*} D_i &\sim \text{Normal}(\mu_i,\sigma) \\ \mu_i &= \alpha + \sum^K_{k=1} w_k B_{k,i} \\ \alpha &\sim \text{Normal}(100,10) \\ w_j &\sim \text{Normal}(0,10) \\ \sigma &\sim \text{Exponential}(1) \end{align*}\]